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INTRODUCTION 


One  of  the  most  fundamental  problems  of  software  engineering  today  - 
both  in  theory  and  in  practice  -  is  the  prediction  of  the  software  length.  Many 
studies  have  indicated  the  length  of  the  program  is  consistently  correlated  with 
some  other  complexity  measurements  of  program's  characteristics  [Basili  83], 
[Mata  84],  [Jensen  85],  [Ivan  87].  In  addition,  a  recent  survey  on  software 
economics  has  listed  software  length  estimation  as  the  first  major  issue  needing 
further  research  [Bohem  84].  The  software  length  can  be  measured  from  a 
number  of  aspects,  such  as,  line  of  source  codes  [Dijkstra  72],  executable  state- 
ments [Curtis  79]  or  total  number  tokens  in  the  program  [Halstead  72].  In 
1972,  Halstead  proposed  a  simple  formula  for  predicting  the  length  of  a  pro- 
gram based  on  the  number  of  unique  operators  and  operands  used  by  the  pro- 
gram [Halstead  72],  Christensen  has  stated  the  general  advantages  of  this 
operator/operand  approach  as:  [Chris  81] 

•  An  explainable  methodology  for  calibrating  a  measurement  instrument. 

•  A  more  nearly  universal  measure,  since  the  approach  is  consistent  across 
the  boundaries  of  programming  languages. 

•  The  ability  to  relate  some  of  the  effects  of  programming  style  to 
measure  quantities. 

Halstead  established  a  theory  based  on  these  empirical  findings,  and 
extended  it  into  various  metrics  for  measuring  the  characteristics  of  the 
software  in  his  literary  work  [Halstead  77].  This  landmark  work  is  well  known 
as  software  science. 
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The  Models  Of  Software  Length  Measurement 

The  estimated  length  suggested  in  the  software  science  is  simply  a  func- 
tion of  the  number  of  unique  operators  (r\{)  and  operands  (TI2) 

N  =  ni-lo&Oh)  +  Tl2-log2(Th)-  (1) 

Halstead's  formula  has  been  widely  applied.  However,  it  is  also  seriously 
questioned  about  its  inaccuracy  [Smith  80],[Lassez  81],  [Shen  83],  [Hamer  82], 
and  ambiguity  [Elshoff  78],[Lassez  81],  [Shen  83],  [Fitsos  80].  The  one  limita- 
tion of  the  length  formula  (1)  as  a  tool  for  estimating  program  length  is  that  T|] 
only  can  be  evaluated  after  the  program  has  been  written.  Fitsos  proposed 
using  a  model  that  only  depends  on  operand  vocabulary  size  [Fitsos  80], 

N  =  c  +  Tl2-log2(r|2)  (2) 

where  c  is  a  language  dependent  constant.  His  suggestion  was  based  on  the 
observation  of  490  PL/S  modules  and  Elshoff  s  data  for  34  PL/I  modules 
[Elshoff  78].  Fitsos  concludes  that  the  number  of  distinct  operators  (Tl1)  for 
programs  written  in  a  higher  level  language  tends  to  be  a  constant.  In  other 
words,  the  program  length  can  be  determined  by  a  function  of  distinct  operands 
number  (TI2).  This  hypothesis  was  reaffirmed  later  by  Christensen  [Chris  81]. 
Fitsos's  methodology  breaks  the  restrictions  of  Formula  (1),  since  the  estimat- 
ing process  can  be  conducted  in  the  variables  declaration  section  of  the  pro- 
gram. Formula  (2)  was  extended  by  Albrecht  [Albrecht  83],  who  reports  the 
data  for  14  modules,  and  suggests  an  alternative  model  of  the  form 

N  =  C  t12l0S2(Tl2)  (3) 
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Formula  (2)  and  (3)  have  been  investigated  by  Levitin  [Levitin  85].  The 
results  of  the  Levitin's  experiments  indicated  that  the  estimator  (3)  is  superior 
to  the  estimator  (2).  About  the  same  time,  [Jensen  85]  studied  the  software 
measures  for  real-time  programs,  and  proposed  a  length  estimation  equation 

N  =log2(ru!)  +  log2(r|2!)  (4) 

Jensen  found  that  the  estimation  results  of  Formula  (4)  are  more  precise 
than  that  of  Halstead's  on  his  data  set. 

The  Nature  of  The  Problem 

Models  described  in  the  previous  section  are  based  upon  various  assump- 
tions. For  example,  Halstead  divides  a  program  of  length  N  into  Nft]  sub- 
strings of  length  r|  (which  is  the  sum  of  number  of  distinct  operands  and  opera- 
tors), and  assumes  there  are  no  duplications  of  these  substrings.  He  also 
assumes  that  operators  and  operands  alternate  in  the  program  [Halstead  77]. 
In  the  models  of  (2)  and  (3),  the  assumptions  are  based  on  the  number  of 
operators  being  a  constant,  and  they  also  employ  the  portion  of  Halstead's  for- 
mula, T|2-log2(r|2),  to  determine  the  value  for  the  operands.  Jensen  [Jensen  85] 
did  not  mention  any  assumptions  nor  deriving  process  of  the  equation  (4). 

The  above  assumptions  are  not  always  true  in  real  programming  environ- 
ments. For  instance,  the  operators  and  operands  do  not  necessarily  alternate. 
The  statement  "  fori;;)  (  "  is  allowed  in  the  language  C.  There  are  four 
operators  occurring  consecutively,  namely  "for", "(",";",  and  "{".  Regarding  to 
formulae  (2)  and  (3),  the  operators  behave  as  a  constant  for  large  programs. 
Fitsos    [Fitsos    80]    and   Albrecht    [Albrecht   83]    both    agree   that   the   term 


Tl2-log2(T|2)  can  determine  the  total  value  for  operands  in  the  program.  How- 
ever, on  observing  the  data  sets  used  in  their  research,  the  author  found  that  the 
two  terms  in  Halstead's  formula  can  not  be  used  to  estimate  AH  and  N2 
respectively;  that  is,  Tli-log2(T|i)  was  not  a  g°°d  estimator  of  Nu  and 
Tl2''0g2(Th)  was  not  a  good  estimator  of  N2- 

The  Aims  of  The  Study 

In  this  report,  the  models  are  developed  based  upon  the  data  sets  and 
without  unnatural  assumptions  are  introduced.  Three  different  data  sets  (  UNIX 
source  codes,  C  programs  written  in  the  course  CMPSC  541,  and  Pascal  pro- 
grams )  are  used  to  investigate  the  estimation  models.  A  correlation  analysis 
between  the  estimated  and  the  actual  length  is  presented.  Additionally,  the 
relative  error  is  used  for  comparing  the  accuracy  of  the  estimations. 
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SOFTWARE  SCIENCE  AND  LENGTH  ESTIMATION 


The  ever-increasing  cost  of  program  development  has  made  the  measure- 
ment of  software  characteristics  more  important  than  it  has  ever  been  before. 
Software  science  includes  some  of  the  most  often  used  measures.  The  metrics 
proposed  by  Halstead's  software  science  are  briefly  discussed  in  this  chapter. 
Several  articles  relating  to  the  software  science  length  estimation  are  reviewed. 

The  theory  of  software  science 

Halstead's  software  science  is  widely  recognized  as  an  important  analyti- 
cal tool  for  the  analysis  and  design  of  software.  Halstead  argues  that  algo- 
rithms or  programs  have  measurable  characteristics  analogous  to  the  charac- 
teristics, such  as  mass,  that  are  used  in  physical  laws.  He  also  suggests  that  a 
set  of  useful  measures  of  program  characteristics  can  be  derived  from  a  count 
of  the  number  and  the  frequency  of  distinct  operators  and  operands  in  an  algo- 
rithm or  a  program.  The  basic  counts  of  software  science  are: 

T|!  =  number  of  distinct  operators 
r|2  =  number  of  distinct  operands 
N  j  =  number  of  operator  occurrences 
N2  =  number  of  operand  occurrences 

Followings  are  the  program  properties  measurements  proposed  by  Hal- 
stead  in  the  software  science: 
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Program  length 

Program  length  N  is  defined  as  the  sum  of  the  total  number  of  operators  Nj 
and  the  total  number  of  operands  N2  (ie.  N  =  Nl  +  N2).  The  value  of  N  can  be 
approximated  by  an  estimator  N  that  is  defined  as: 

N  =Tirlog2(r|1)  +  ri2-log2(ri2). 

Program  volume 

A  program  volume  metric  V  is  defined  as 

V=AHog2(Tl) 

Volume,  in  the  other  sense,  is  the  size  of  an  implementation,  which  can  be 
thought  as  the  number  of  bits   necessary  to  express  an  algorithm. 

Potential  volume 

The  potential  volume  V*  is  the  minimum  possible  volume  for  the  given  algo- 
rithm.  V*    is  of  the  form 

V'  =  (2  +  ri2*)  ■  log2(2  +  rij*), 

where  ri2  is  the  observed  input  operands  required  by  the  program. 

Program  level  (difficulty) 

Any  given  algorithm  with  volume  V  is  considered  to  be  implemented  at  the 

program  level  L,  which  is  defined  as 


XL 
v  ' 
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and  the  inverse  of  the  program  level  is  termed  the   difficulty.   That  is 


D=l. 


Program  effort 

Program  development  requires  more  effort  when  the  size  of  the  program 
increases;  but,  it  needs  less  effort  when  the  language  is  high.   The  effort  E, 
then,  derived  as: 


L 


The  unit  of  measurement  of  E  is  "elementary  mental  discriminations". 

Programming  time 

The  programming  time  T  is   proportional  to  the  effort  E  in  developing  a  pro- 
gram, E  is  defined  as  the  form 


'-f 


for  some  constant  S .   The  constant  S  represents  the  speed  of  programming.  In 
other  word,  the  number  of  mental  discriminations  per  second  of  which   pro- 
grammer is  capable.  An  5  value  of  18  is  normally  is  normally  used.   This 
number  is  based  on  the  work  of  Stroud. 

Software  science  has  been  accepted  and  discussed  by  many  authors.  There 
are  many  valuable  and  important  articles  concerned  with  software  science  that 
are  listed  and  annotated  in  bibliography  [Leslie  87]. 


Program  Length  Estimation 

In  software  science,  the  length  of  a  program  is  a  function  of  the  number  of 
unique  operators  and  operands.  This  hypothesis  has  received  the  most  attention 
since  it  can  be  easily  tested.  Halstead  assumed  that  the  accuracy  of  the  for- 
mula is  dependent  on  the  "purity"  of  the  algorithm  implementations.  The  types 
of  'impurity'  can  be  classified,  according  to  [Halstead  77],  as 

•  a  complementary  operation, 

•  ambiguous  operands, 

•  synonymous  operands, 

•  common  subexpression, 

•  unwarranted  assignment,  and 

•  unfactored  expressions. 

[Elshoff  78]  measured  154  programs  and  confirmed  this  hypothesis,  he  also 
pointed  out  that  "  if  N=N  only  for  pure  or  well  programmed  algorithms,  then  a 
simple  check  for  pure  or  well  programmed  programs  is  available.  ". 

The  operators/operands  can  be  viewed  as  simply  analogous  to  the  daily 
conversational  sentence.  Operators  are  the  verbs,  and  operands  are  subjects  or 
objects.  However,  in  some  programming  language,  the  classification  of  opera- 
tor and  operand  becomes  very  ambiguious.  Most  of  the  supporting  experiments 
presented  in  the  [Halstead  77]  derived  from  the  collected  algorithms  of  the 
ACM  and  very  small  program  in  ALGOL  and  FORTRAN.  In  both  languages, 
it  is  not  difficult  to  classify  a  token  into  operator  or  operand.  However,  in 
other  languages,  sometimes,  it  can  lead  to  an  ambiguous  situation.  Neverthe- 
less, from  the  aspect  of  length  estimation,  [Shen  83]  Shen  pointed  out  the 
misclassification  of  any  token  has  virtually  no  effect  on  the  final  estimate,  since 


n  =  fh'tofeiii  +  Ti2-i°g2n2  =  n-iog2-^- 
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However,  except  for  length  estimation,  when  the  other  characteristics  are 
concerned,  Lassez  criticized  that  the  software  science  is  not  applicable  because 
of  unclear  definitions  of  operator,  operand  and  input/output  parameter  [Lassez 
81]. 

The  tokens  in  the  declaration  sections  are  not  counted  as  the  part  of  length 
of  the  program  [Halstead  77].  It  causes  an  obvious  variations  of  estimation, 
since  the  variable  declaration  sections  in  some  languages  (eg,,  data  division  in 
Cobol)  represent  a  significant  portion  of  the  programming  effort.  Therefore, 
many  authors  suggest  that  all  software  science  analysers  should  count  opera- 
tores  and  operands  in  declaration  sections  as  well  as  in  procedure  sections. 
[Shen  79],  [Fitsoss  79],  [Elshoff  78],  [Lassez  81] 

Experiments  have  been  conducted  by  Halstead  and  others  to  validate  the 
length  estimation.  Tests  have  been  conducted  on  FORTRAN  programs  [Hal- 
stead 77],  [Basili  83];  Cobol  programs  [Bulut  74],  [Zweben  79],  [Shen  79a]; 
PL/I  programs  [Elshoff  76],  [Smith  80];  Pascal  programs  [Feuer  79],  [Fitz  78], 
[Lassez  79];  APL  modules  [Zweben  79];  IBM370  Assembly  programs  [Smith 
80];  and  C  program  [Crawford  85],  all  observing  high  correlation  between 
predicted  and  observed  length. 

However,  some  found  that  Halstead' s  estimated  length  tends  to  be  low  for 
large  programs  and  high  for  small  programs  [Smith  80],  [Fitsos  80],  [Shen 
79a].  Shen  asserted  that  the  Halsteads's  length  estimation  appears  to  work  best 
for  programs  of  size  in  the  range  between  2000  and  4000  [Shen  83].  Feuer 
also  reported  that  the  length  equation  overestimates  the  actual  length  80%  of 
the  time  for  197  PL/I  programs.    In  his  experiment  most  of  the  programs  are 
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less  than  2000  [Feuer  79].  Therefore,  Shen  has  suggested  that  the  relative  error 
of  Halstead  length  equation  can  be  minimized  by  dividing  a  large  program  into 
modules  of  reasonable  size  and  then  summing  the  individual  estimates  [Shen 
83]. 

Shooman,  in  1977,  used  a  set  of  psychometric  relationships  suggested  by 
Zipf  to  estimate  program  length  from  the  number  of  unique  operators  and 
operands  of  a  program  [Shooman  83].  Shooman  views  the  program  as  a  string 
of  tokens.  The  token  string  which  represents  the  program  is  generated  by 
choosing  an  operator  from  the  operator  set  randomly,  then  choosing  an  operand 
from  the  operand  set  at  random,  and  continuing  this  alternation  process.  The 
process  halts  whenever  the  last  operator  or  operand  is  chosen  for  the  first  time. 
Based  on  these  assumptions,  he  derives  a  series  equation  to  estimate  the  length 
of  program.  Mohanty  [Mohanty  79]  has  also  demonstrated  that  a  close  agree- 
ment exists  between  the  software  science  results  and  the  results  obtained  by  the 
application  of  Zipf  s  law.  However,  Sooman's  work  has  also  been  criticized 
extensively  [Moranda  85]  on  the  ground  of  meaningless  substitutions,  equating 
different  probability  constants,  alternation  of  sourcer  data  set,  and  violation  of 
Zipf  s  law. 
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EXPERIMENT  DESCRIPTION 

AND 

ALTERNATIVE  MODELS 


The  primary  methodology  applied  in  the  study  is  based  on  empirical 
observation  of  program  files.  Three  data  sets  of  two  languages,  C  and  Pascal, 
are  used  to  investigate  the  length  estimation  models.  In  order  to  view  the 
behavioral  trend  when  actual  length  increases  or  decreases,  the  data  sets  are 
also  divided  into  various  partitions.  Four  new  models  are  introduced,  two  of 
them  are  built  based  upon  the  transformed  data  so  that  the  models  become 
more  appropriate  for  linear  modeling  procedures.  Another  two  models  are 
suggested  with  the  same  pattern,  but  with  the  error  terms  handled  in  the 
different  ways.  The  relative  error  is  used  for  comparing  the  superiority  or 
inferiority  of  the  models.  Eight  models  are  analyzed  and  compared  using  all 
of  the  data  or  the  partitioned  data  sets. 

Counting  Rules 

In  the  current  research,  modules  that  extract  the  operators  and  operands 
from  the  programs  were  implemented  (See  Appendix  D).  The  rules  that  dis- 
tinguish tokens  of  operators  or  operands  are  as  follows: 

Operators  -       keywords; 

Operators  symbols;  (a  pair  symbols  is  counted  as  one) 
function  name; 
procedure  name; 
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Operands  -       variable  name; 

numerical  constant; 
quoted  string; 

Comments  are  not  considered  operators  nor  operands. 

Alternative  Models 

Each  of  the  collected  data  sets  have  many  cases  with  length  less  than  500 
(See  Figure  1).  If  the  data  sets  were  investigated  directly  without  any  transfor- 
mation, the  model  could  produce  results  much  favored  to  the  programs  of 
large  size  in  terms  of  relative  error.  Therefore,  the  logarithmic  transformation  is 
applied  in  order  to  avoid  this  situation  (See  Figure  2).  The  selection  of  loga- 
rithmic transformation  is  quite  subjective;  however,  other  transformation  pro- 
cedures are  also  worthy  of  being  studied  in  the  future. 

Based  on  initial  analysis  efforts,  the  combinations  of  Tij  and  T|2  to  be  used 
as  independent  variables  are  (ru  +  r\2)  and  (Th-ri^.  The  logrithm  of  these  two 
combinations  are  suggested  because  of  their  higher  correlation  with  that  of  N 
than  any  others  in  the  preliminary  effort.  When  observing  the  data  displayed  on 
figures,  the  distribution  over  the  domain  of  the  variable  is  asymmetric  with 
positive  skew  (i.e.  long  tail  to  the  right).  The  transformation  brings  the  high 
variability  for  large  programs  to  be  more  homogeneous  with  that  of  small  pro- 
grams. 
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Figure  3 
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Figure  4 
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Figure  5  and  6,  present  the  distribution  of  the  data  points  after  the  raw  data  was 
transformed.  The  simple  linear  regression  modeling  technique,  then,  is  applied 
to  construct  the  estimating  models.  The  ln(N)  value  is  estimated  by  the  value 
of  In  (r|i+T|2)  and  In  Oli"^),  of  the  forms 

/n(iV)  =  P0  +  P1-ln(Tl1+T|2)  +  e,  (5.1) 

and, 

In  (N)  =  P0  +  p,-/n  (ThTfc)  +  e.  (5.2) 

The  estimated  N  value  can  be  obtained  from  the  above  equation  by  applying 
the  inverse  transformation.   The  models  are  then  expressed  in  the  equations: 

N  =  exp (P0  +  pr/n (Tli+Tli))  ■  e*  (6.1) 

and, 

N  =  exp  (p0  +  Pf/nfarTh))  '  e*  (6.2) 

where  e*  =  expie).   The  equations  (6.1)  and  (6.2)  then,  are  simplified  as: 

N  =  T(rii+ri2)|3-e*  (7.1) 

and, 

N  =  rOlrTh)p-e*  (7.2) 

where  y  =  exp  (P0)  and  P  =  Pj. 
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Ln(N) 


Ln  (T| ,  ■  Ti2) 
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The  error  term  (  e*  )  handled  in  above  two  models  (7.1)  and  (7.2)  are  multipli- 
cative instead  of  additive.  Therefore,  the  models  of  handling  error  terms  in 
additive  fashion  are  also  considered  as  the  form  of: 

N  =  a(T]1+Ti2)'J  +  e+  (8.1) 

and, 

N  =  a-Oij-ii^  +  e+  (8.2) 

The  procedure  of  deriving  the  parameters  in  (8.1)  and  (8.2)  is  not  the 
same  as  in  (7.1)  and  (7.2).  The  parameters  in  (7.1)  and  (7.2)  are  obtained  only 
by  running  simple  linear  regression  on  the  transformed  data.  However,  in  the 
latter  models,  (8.1)  and  (8.2),  the  procedures  of  acquiring  the  parameter  is  by 
nonlinear  least  squares.  Note  the  method  in  this  procedure  is  modified  Gauss- 
Newton  method,  and  the  data  are  processed  by  the  procedure  NLIN  in  SAS 
package  [SAS-STAT  82]. 

For  the  convenience  sake,  all  the  models  will  be  labelled  by  a  particular 
symbol  in  the  rest  of  this  report.  (See  Table  1) 

The  Description  of  Experiment 

The  experiment  is  conducted  starting  from  data  collection,  segmentation, 
then,  followed  by  parameters  development,  length  estimation,  correlation  of 
actual  length  versus  estimated  length,  and  relative  error  analysis.  The  results  of 
the  experiment  are  discussed  in  the  following  chapter. 
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Table  1 

Symbol  Model 

H  N  -  T|l'l0g2(Tll)  +  Tl2-l°g20l2) 

F+  N  =  c  +  r\2  •  log2(Ti2) 

A*  N  =c  ■  ti2  •  log2(Th) 

J  N  =  logjCTlj!)  +  log2(ThO 

L  +  N=  y(Tll+Tl2)P 

L*  N  =  yiT\VT\2f 

NL+  N  =  a-Crij+Ti^P 

M,*  iV  =  a-Oi!-^ 

Note:   L+  and  L    are  the  models  derived  from  simple  linear  regression  of 
logorithmic  transformation. 

NL+  and  NL    are  the  models  derived  from  nonlinear  regression  by  least 
squares. 
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1)  Data  collection  and  partition: 

There  are  three  data  sets  used  to  investigate  the  estimation  models: 

•  799  UNIX  system  source  codes  (See  Appendix  A), 

•  99  CMPSC  541  project  programs  written  in  C  (See  Appendix  B),  and 

•  404  Pascal  programs  acquired  from  Mata-Toledo's  dissertation 
[Mata  84]  (See  Appendix  C). 

The  data  set  that  includes  all  these  three  data  sets  is  also  observed. 

In  order  to  see  the  behavior  of  the  errors  on  each  model  under  various 
program  length,  the  data  are  also  partitioned  into  five  parts  by  the  size  of 
the  actual  length. 

i.   Total  (include  all  the  observations) 
ii.   Actual  length  is  less  than  or  equal  to  500. 
iii.   Actual  length  is  between  501  and  1000. 
iv.   Actual  length  is  between  1001  and  2000. 

v.   Actual  length  is  between  2001  and  4000. 
vi.   Actual  length  is  more  than  4000. 

2)  Parameter  Estimation: 

According  to  the  models  described  in  this  section,  the  parameters  were 
estimated  for  all  the  models.  The  procedure  of  deriving  the  parameters  of 
the  model  L+,  L* ,  NL+,  and  NL*  are  discussed  in  the  previous  section. 
There  is  no  parameter  needed  in  the  model  H  and  J .   The  c  constant 
value  in  the  model  F+   is  obtained  from  averaging  the  c's  which  are  cal- 
culated in  the  individual  observation.  In  the  model  A  * ,  the  c  constant 
value  is  estimated  by  fitting  a  linear  model  without  an  intercept.   Note 
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that  the  parameters  were  estimated  based  upon  the  complete  data  sets,  but 
not  on  the  terspartitioned  ones. 

3)  Estimated  Length  Acquisition  and  Relative  Error  Calculation: 

The  parameters  were  used  in  the  model(s)  to  calculate  the  estimated  length 
for  each  observation.   After  all  the  estimated  length  were  obtained,  then, 
the  relative  errors  of  each  observation  were  calculated  by  the  equation: 

Relative  Error  =  I  estimated  length  -  actual  length  I  /  actual  length 

4)  Correlation  Coefficients  Comparisons: 

The  correlation  coefficient  is  a  measure  of  the  linear  relationship  between 
two  variables.  These  coefficients  are  calculated  in  order  to  examine  the 
linear  relationship  between  the  estimated  length  and  actual  length  for  all 
the  models  in  various  data  sets. 

5)  Mean  of  Relative  Error: 

The  correlation  coefficient  estimates  the  degree  of  the  closenesss  of  linear 
relationship  between  two  variables.   In  these  variables  (estimated  and 
actual  length),  the  relative  error  is  also  important.  However,  the  correlation 
coefficient  does  not  provide  information  of  the  closeness  of  two  variables. 
Therefore,  the  relative  errors  are  also  used.  The  mean  of  the  relative  error 
according  to  the  combination  of  models  and  the  partitioned  data  are  com- 
puted.  These  values  represent  the  accuracy  of  the  estimation  of  each 
model.  The  number  of  over-estimated  and  under-estimated  are  also  deter- 
mined by  comparing  the  estimated  length  and  actual  length. 
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RESULTS  AND  DISCUSSIONS 

The  results  of  the  experiment  described  in  the  previous  chapter  are 
presented  in  tabular  form.  Table  2.1,  2.2,  2.3  and  2.4  present  the  parameters 
developed  based  upon  various  data  sets,  such  as  Total,  UNIX,  Pascal  and 
CMPSC  541.  Table  3  shows  the  correlation  coefficients  between  the  estimated 
length  and  actual  length  under  particular  combination  of  model  and  data  sets. 
The  means  of  relative  error  are  presented  in  the  table  3.  Concerning  the  vari- 
ous range  of  actual  length  data,  the  rest  of  the  tables  indicate  the  mean  of  rela- 
tive error  and  count  of  over-  and  under-  estimated  of  various  range  of  actual 
program  length.  These  tables  are  named  as  the  form  of  X-Y,  where  X 
represents  the  name  of  the  data  set,  and  Y  the  model  name.  The  accuracy  of 
the  model  is  defined  as  small  MRE  and  balance  of  the  counts  of  overestimating 
and  underestimating  of  actual  length. 

Correlation  Coefficients  of  Actual  length  vs.  Estimated  Length 

On  observing  the  table  3,  it  can  be  seen  that  all  the  estimated  values  are 
highly  correlated  with  the  actual  length  in  the  various  data  sets.  The  correla- 
tion coefficients  in  most  of  the  others  are  higher  than  0.9  (Except  for  the  data 
set  of  CMPSC  541).  Roughly  speaking,  the  models  proposed  in  this  report 
have  the  coefficients  values  a  little  bit  higher  than  the  others.  (Except  NL*  in 
total  set,  L*  in  Pascal  set,  and  L+  in  CMPSC  541  set).  High  correlation 
means  that  the  two  variables  are  likely  to  have  a  linear  relationship.  But,  high 
correlation  does  not  imply  that  the  N  is  equal  or  close  to  N .  In  order  to  exam- 
ine more  detail  of  the  estimation  models,  the  term  of  relative  error  is  employed, 
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Table  2.1 
Parameters  Estimation  (  Total  observations  ) 


(  Number  of  Observations  =  1302  ) 


Model 

Parameter 

S.E. 

H 

None 

F* 

c  =  526.3166 

30.2102 

A' 

c  =     1.52496 

0.017142 

J 

None 

L+ 

y=  -0.797189 
P  =  1.523624 

0.046869 
0.010185 

L' 

Y  =  0.060337 
P  =  0.801774 

0.039726 
0.005156 

NL+ 

a  =  2.111956 
P=  1.261011 

0.189451 
0.014111 

NL' 

a  =  1.085501 
P  =  0.808215 

0.165480 
0.014440 

Note:S.E.  is  the  standard  error  of  the  estimation  of  corresponding  parameter. 
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Table  2.2 
Parameters  Estimation  (  UNIX  source  programs  ) 


(  Number  of  Observations  =  799  ) 


Estimator 

Parameters 

S.E. 

H 

None 

F+ 

c  =  710.9088 

38.4683 

A* 

c  =     1.876848 

0.023487 

J 

None 

L+ 

y  =  -0.938968 
P  =  1.557691 

0.063487 
0.013077 

L* 

y=  0.169702 
P  =  0.785363 

0.052626 
0.006832 

NL+ 

a=  1.172288 
P  =  1.368080 

0.150518 
0.021244 

NL* 

a  =  1.5572644 
P  =  0.760995 

0.185082 
0.011297 

Note:S.E.  is  the  standard  error  of  the  estimation  of  corresponding  parameter. 
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Table  2.3 
Parameters  Estimation  (  Pascal  programs  ) 


(  Number  of  Observations  =  404  ) 


Model 

Parameter 

S.E. 

H 

None 

F+ 

c  =  174.2780 

52.0660 

A" 

c  =     1.28740 

0.020870 

J 

None 

L* 

y=  -0.541532 
P  =  1.426176 

0.060009 
0.014427 

L* 

Y  =  -0.217718 
P  =  0.838510 

0.060525 
0.009033 

NL+ 

rx=  1.1170146 
P  =  1.344902 

0.248648 
0.032790 

NL* 

a  =  0.062634 
p=  1.120711 

0.018846 
0.028079 

NoterS.E.  is  the  standard  error  of  the  estimation  of  corresponding  parameter. 
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Table  2.4 
Parameters  Estimation  (  CMPSC  541  programs  ) 


(  Number  of  Observations  =  99  ) 


Model 

Parameter 

S.E. 

H 

None 

F  + 

c  =  473.1435 

92.7675 

A' 

c  =    2.490146 

0.135502 

J 

None 

L* 

Y  = -1.051215 

(3  =  1.677929 

0.333824 
0.079486 

L* 

Y  =  0.034634 
(3  =  0.856614 

0.287546 
0.041240 

NL+ 

a  =  2.7808183 
P  =  1.287783 

1.066685 
0.068851 

NL* 

a  =  1.808370 
P  =  0.811828 

0.713664 
0.041813 

Note:S.E.  is  the  standard  error  of  the  estimation  of  corresponding  parameter. 
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Table  3 
Correlation  Coefficients  of    N   vs.    N 


Model 

Total 
(1302) 

UNIX 
(799) 

Pascal 

(404) 

CMPSC  541 
(99) 

11 

0.92994 

0.92880 

0.94333 

0.86263 

F+ 

0.90285 

0.90523 

0.94455 

0.83653 

A* 

0.90285 

0.90523 

0.94455 

0.83653 

J 

0.92916 

0.92892 

0.94432 

0.85968 

L+ 

0.91734 

0.92728 

0.94341 

0.82119 

L* 

0.88873 

0.94112 

0.93455 

0.88478 

NL+ 

0.93191 

0.93225 

0.94466 

0.85978 

NL* 



0.88872 

0.94129 

0.94928 

0.88684 
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and  discussed  in  the  following  secitons. 

Mean  Of  Relative  Error  (MRE)  Analysis: 

The  values  of  the  mean  of  relative  errors  are  listed  in  Table  4;  the  values 
are  obtained  by  observing  the  full  range  of  the  various  data  sets.  The  value  in 
the  parenthesis  of  each  box  presents  the  rank  of  the  models  by  the  value  of 
MRE  for  the  particular  data  set.  For  a  more  detail  investigation  of  the 
behavior  of  the  MRE,  each  data  set  was  partitioned  into  five  parts  according  to 
the  actual  length  of  the  programs.  The  MRE  and  the  counts  of  over-  and 
under-  estimated  are  illustrated  in  the  tables  from  page  of  36  to  43.  These 
tables  are  arranged  according  to  the  combination  of  the  model  and  the  data  set 
which  is  observed.  In  each  table,  the  first  column  shows  the  range  of  the  actual 
length,  the  second  column  indicates  the  number  of  the  observations,  the  MRE, 
and  the  counts  of  over-  and  under-  estimated  observation  are  listed  in  the 
column  3,  4  and  5  respectively. 

The  table  4  shows  the  models  of  L+  and  L*  have  smaller  MRE  than  most 
other  models.  Model  F+  has  largest  MRE  in  the  listed  eight  models.  If  we 
sum  the  rank  (  values  in  the  parentheses  )  for  each  model,  then  the  superiority 
rank  of  the  models  can  simply  drawn  by  this  sum.  Models  L+  and  L*  are 
ranked  first  (6),  being  followed  by  J  (16),  NL*  (18),  A*  (18),  H  (22),  NL* 
(26),  and  F+  (32).  It  is  not  reasonable  to  say  that  these  results  are  final;  the 
behavior  of  MRE  in  different  range  of  actual  length,  the  counts  of  overes- 
timated and  under  estimated  also  need  to  be  investigated. 
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Table  4 


Mean  Relative  Error  Comparisons 


Model 

Total 

UNIX 

Pascal 

CMPSC  541 

H 

0.43321 
(6) 

0.36637 
(6) 

0.58857 
(7) 

0.33869 
(3) 

F+ 

3.11230 
(8) 

2.69839 
(8) 

1.79549 
(8) 

1.68634 
(8) 

A* 

0.36258 
(5) 

0.32643 
(5) 

0.35115 
(4) 

0.35556 
(4) 

J 

0.29666 

(4) 

0.30660 

(4) 

0.26254 
(3) 

0.35571 
(5) 

L+ 

0.27382 
(2) 

0.24802 
(2) 

0.24323 
(1) 

0.32592 
(1) 

L* 

0.25430 
(1) 

0.23204 
(1) 

0.25147 
(2) 

0.32858 
(2) 

NL+ 

0.66637 

(7) 

0.40005 
(7) 

0.50958 
(5) 

0.85043 
(7) 

NL* 

0.28892 
(3) 

0.27106 
(3) 

0.57086 
(6) 

0.53149 
(6) 
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The  characteristics  of  each  model's  estimation  are  discussed  in  the  following 
sections: 

H   (N  =  T1flog2(Tlt)  +  ib'lQfcClh) ): 

The  bias  of  Halstead's  model  was  inspected  (page  36);  it  tends  to  overes- 
timat  in  the  small  programs  but  underestimate  in  the  large  program  size.  In  the 
programs  of  length  less  than  500,  H  has  about  90  percent  of  the  time  overes- 
timated the  actual  length;  in  contrast  lengths  greater  than  4000,  H  underes- 
timated the  actual  length  more  than  92  percents  of  the  time  (page  36).  In  the 
sets  of  C541  programs,  it  always  overetimated  when  the  actual  length  is  greater 
than  500,  and  MRE  apparantly  increases  when  N  goes  up.  The  range  of  actual 
length  between  500  and  1000  seems  more  suitable  for  the  Halstead's  model, 
that  is  of  more  balance  of  over  and  under  estimation  and  smaller  MRE. 

F+  (N  =  c  +  rvlog2(ri2)  ): 

Because  of  high  variation  of  the  constant  c  in  the  Fitsos's  model,  the 
accuracy  of  this  model  has  been  questioned.  The  coefficient  of  variations  of  c 
are  207.1156,  152.9544,  600.4850  and  195.0836  in  the  data  sets  of  Total, 
UNIX,  Pascal  and  C541  respectively.  The  results  of  MRE  also  show  the  obvi- 
ous bias  of  Fitsos's  model,  not  only  high  MRE  but  also  seriously  overestimat- 
ing the  small  size  of  the  program  (page  37). 

A  *    (  N  =  c  ■  r|2-log2(T)2)  ): 

The  model  of  A  behaves  more  consistently  than  H  or  A  * .  There  is  no 
obvious  trend  of  MRE  and  of  the  unbalance  of  counts  of  over/under  estimation 
appearing  when  the  program  length  changed.  In  the  Total  and  UNIX  data  sets 
(see  page  38),  it  becomes  more  accurate  when  the  actual  length  grows,  the 
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results  also  agrees  with  Fitsos's  assumptions  that  the  number  of  operator 
becomes  constant  when  the  program  length  increases.  It  has  a  very  high  MRE 
and  an  unbalance  of  over/under  estimating  counts  in  the  Pascal  data  set  of 
mid-size  range  of  the  N . 

J   (N  =tafe(ThO  +  lofc(Ti2!)): 

This  model  has  been  investigated  by  Jensen  [Jensen  85],  who  used  the 
data  of  the  length  less  than  400  in  the  average.  According  to  this  range  of  the 
length,  author  found  agreement  with  Jensen's  results,  that  model  /  is  a  quite 
good  model  when  the  actual  program  length  less  than  500.  However,  when  the 
actual  length  greater  than  500  being  observed,  the  trend  of  the  MRE  occurs 
(page  39).  This  trend  shows  the  model  J  tends  to  under  estimate  the  actual 
length  when  it  increases.  This  phenomena  appears  in  all  four  data  sets,  for 
example  in  the  UNIX  data  set,  the  MRE  in  the  size  less  than  500  is  0.24601, 
and  percentage  of  underestimating  counts  is  43,  but  in  the  size  greater  than 
4000,  MRE  become  0.47555  and  percentage  of  underestimating  counts  becomes 
100.  It  has  almost  100  percent  of  time  underestimated  the  actual  length  when 
N  was  greater  than  4000. 

L+   (iV  =T(Tl1+Tl2)P): 

The  model  L+  has  very  low  MRE  and  balance  of  over/under  estimating 
counts  in  all  four  data  sets  (page  40).  This  is  dure  to  the  model  being  derived 
from  a  least  square  approach.  There  is  a  little  bit  higher  MRE  in  Pascal  data  set 
in  the  range  of  500  to  2000  (  MRE  is  about  0.48  ),  but  it  is  still  lower  than  that 
of  most  of  other  models. 


L'    (W=r(Tli-Tl2)P): 
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In  the  viewpoint  of  accuracy,  models  L*  and  L+  are  very  similar 
41).  It  has  low  MRE  and  balance  of  over/under  estimating  counts.    These  two 
models,  L*  and  L  ,  shows  the  lowest  MRE  of  eight  models  in  current  study. 

NL+  (  N  =  aii\i+T\2)t ): 

The  model  NL+  tends  to  overestimates  when  actual  length  is  small.  It  has 
highest  MRE  and  percentage  of  overestimating  counts  in  the  length  smaller 
than  500,  and  become  more  accuracy  when  the  size  increases  (page  42).  (except 
the  size  greater  than  4000  in  the  C541  data  set). 

NL'    (N  =  a-Oii-Th)"    ): 

This  model  has  a  low  MRE,  but  the  count  of  over/under  estimating  seems 
to  be  language  dependent,  in  the  programs  of  short  length.  For  instance,  the 
programs  of  the  length  less  than  500,  there  are  more  than  78  percents  of  the 
time  overestimate  the  length  in  UNIX  data  set  (page  43),  and  90  percents  in 
C541  data  set.  In  contrast  to  this  result,  there  are  99  percents  of  the  time 
underestimate  the  length  in  the  Pascal  set. 

Summary 

From  the  above  analysis,  the  models  of  L+  and  L*  are  suggested  as  the 
program  length  estimation.  Not  only  do  they  have  lower  MRE  but  also  the  bal- 
ance of  over/under  estimating  counts  is  good.  The  model  H  tends  to  overesti- 
mate the  small  program,  and  underestimate  the  large  programs.  Model  F+  has 
very  high  MRE's  so  that  the  model  is  not  suggested  for  estimating.  Model  A  * 
is  good  when  the  actual  length  is  large.  Model  J  has  serious  bias  dealing  with 
the  large  programs,  since  the  trend  of  MRE  is  existent;  however,  in  the  small 
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data  set  it  provides  very  impressive  outcomes.  For  its  simple  structure  and 
being  parameter  free,  model  J  is  suggested  when  the  program  length  is  not 
very  large.  Model  NL+  has  higher  MRE,  and  NL*  shows  the  results  language 
dependent  in  small  size  programs,  besides,  the  parametric  values  development 
in  these  two  models  is  very  time  consuming,  so  NL+  and  NL*  are  not  recom- 
mended for  the  length  estimation. 

From  the  view  point  of  correlation  coefficients,  these  eight  models  pro- 
vided estimated  length  highly  correlated  with  actual  length.  Nevertheless,  some 
more  justifications  are  required  for  most  of  them  so  that  the  model  can  function 
much  better  in  estimation. 
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Table 


Total  -  H 


Actual  Ixngth 

Ohs 

MRF. 

N>0 

1302 

0.43321 

803 

498 

N<501 
500<N<  1001 
1000  <  N  <  2001 
2000  <  N  <  4001 

N>4000 

702 
165 
186 
148 
101 

0.55397 
0.25986 
0.30685 
0.26298 
0.35923 

618 

81 

77 

24 

3 

83 

84 
109 
124 

98 

Table 

UNIX  - 

H 

Actual  Length 

Ohs 

MRF. 

N>0 

799 

0.36637 

413 

385 

N<501 
500  <  N  <  1001 
1000  <  N  <  2001 
2000  <  N  <  4001 

N>4000 

315 
135 
149 
131 
69 

0.54309 
0.21209 
0.22852 
0.25918 
0.36267 

280 

72 

48 

12 

1 

34 

63 

101 

119 

68 

Table        Pascal  -  H 


Actual  Length 

Ohs 

MRF. 

N>0 

404 

0.58857 

348 

56 

N<501 
500  <  N  <  1001 
1000  <  N  <  2001 
2000  <  N  <  4001 

N>4000 

318 
14 
32 
13 

27 

0.62077 
0.61437 
0.65151 
0.23541 
0.29136 

296 

9 

29 

12 
2 

22 
5 
3 
1 

25 

Table 


C541  -  H 


Actual  Length 

Ohs 

MRF. 

N>0 

99 

0.33869 

42 

57 

N<501 
500  <  N  <  1001 
IOOO<N<2001 
2000  <  N  <  4001 

N>4000 

69 

16 
5 
4 
5 

0.29584 
0.35269 
0.43507 
0.47689 
0.67826 

42 
0 
0 
0 
0 

27 
16 
5 
4 
5 
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Table      Total  -  F* 


Actual  Length 

Ohs 

MRF. 

Over 

N>0 

1302 

3.11230 

965 

337 

N<501 
500  <  N  <  1001 
1000  <  N  <  2001 
2000  <  N  <  4001 

N>4000 

702 
165 
186 
148 
101 

5.47453 
0.37653 
0.31914 
0.31671 
0.40336 

702 

151 

88 

19 

5 

0 

14 

98 

129 

96 

Table        UNIX  -  F* 


Aetnal  Length 

Ohs 

MRF. 

Over 

Under 

N>0 

799 

2.69839 

554 

245 

N<501 

315 

6.28746 

315 

0 

500<N<  1001 

135 

0.60655 

135 

0 

1000  <  N  <  2001 

149 

0.21640 

91 

58 

2000  <  N  <  4001 

131 

0.25583 

13 

118 

N>4000 

69 

0.40324 

0 

69 

Table 

Pascal  - 

F* 

Ar.nial  1  enph 

Ohs 

MRF 

Over 

N>0 

404 

1.79549 

365 

39 

N<501 
500  <  N  <  1001 
1000  <  N  <  2001 
2000  <  N  <  4001 

N>4000 

318 
14 
32 
13 
27 

2.16064 
0.57567 
0.60740 
0.19744 
0.30485 

315 

9 

28 

11 

2 

3 
5 
4 

2 

25 

Table 

C541  -  F* 

Aetnal  Length 

Ohs 

MRF 

Over 

N>0 

99 

1.68634 

76 

23 

N<501 
500<N<  1001 
1000  <  N  <  2001 
2000  <  N  <  4001 

N>4000 

69 

16 

5 

4 

5 

2.29610 
0.12844 
0.27222 
0.45518 
0.65602 

69 

7 
0 
0 
0 

0 
9 

5 
4 
5 
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Table 

Total  -  A* 

Actual  l.e.nph 

Ohs 

MRF. 

N>0 

1302 

0.36258 

454 

848 

N<501 
500  <  N  <  1001 
1000  <  N  <  2001 
2000  <  N  <  4001 

N>4000 

702 
165 
186 
148 
101 

0.33771 
0.41478 
0.47152 
0.32260 
0.30813 

292 
44 
66 
33 
19 

410 
121 
120 
115 
82 

Table 


UNIX  -  A 


Ar.hial  Length 

Ohs 

MRF 

N>0 

799 

0.32643 

355 

444 

N<501 
500  <  N  <  1001 
1000  <  N  <  2001 
2000  <  N  <  4001 

N>4000 

315 
135 
149 
131 
69 

0.37425 
0.37337 
0.32088 
0.23192 
0.20768 

155 
62 
72 
51 
15 

160 
73 
77 
80 
54 

Table        Pascal  -  A* 


Actual  Length 

Ohs 

MRF. 

N>0 

404 

0.35115 

187 

217 

N<501 
500  <  N  <  1001 
1000  <  N  <  2001 
2000  <  N  <  4001 

N>4000 

318 
14 
32 
13 
27 

0.28707 
0.75631 
0.87499 
0.42121 
0.24126 

129 

9 

29 

12 
8 

189 

5 

3 

1 

19 

Table 

C541  -  A 

Actual  T  .pn£th 

Ohs 

MRF 

N>0 

99 

0.35556 

69 

30 

N<501 
500  <  N  <  1001 
1000  <  N  <  2001 
2000  <  N  <  4001 

N>4000 

69 

16 

5 

4 

5 

0.38227 
0.29103 
0.17190 
0.16545 
0.52914 

56 
9 
2 
1 
1 

13 

7 
3 
3 
4 
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Table 


Total  -  J 


Actual  Length 

Ohs 

MRF 

N>0 

1302 

0.29666 

468 

834 

N<501 
500<N<  1001 
1000  <  N  <  2001 
2000  <  N  <  4001 

N>4000 

702 
165 
186 
148 

101 

0.24204 
0.30509 
0.34242 
0.37448 
0.46428 

396 

27 

37 

7 

1 

306 
138 
149 
141 
100 

Table 

UNIX- 

J 

Actual  Length 

Ohs 

MRF. 

N>0 

799 

0.30660 

212 

587 

N<501 
500  <  N  <  1001 
1000  <  N  <  2001 
2000  <  N  <  4001 

N>4000 

315 
135 
149 
131 
69 

0.24601 
0.26569 
0.32052 
0.38967 
0.47555 

178 
19 
13 
2 
0 

137 
116 
136 
129 
69 

Table 

Pascal  - 

J 

Actual  Length 

Ohs 

MRF. 

N>0 

404 

0.26254 

245 

159 

N<501 
500<N<  1001 
1000<N<2001 
2000  <  N  <  4001 

N>4000 

318 
14 
32 
13 
27 

0.23352 
0.44932 
0.40956 
0.15619 
0.38438 

207 

8 

24 

5 

1 

111 

6 

8 

8 

26 

Table 

C541  - . 

r 

Actual  Length 

Ohs 

MRF. 

N>0 

99 

0.35571 

n 

88 

N<501 
500  <  N  <  1001 
1000  <  N  <  2001 
2000  <  N  <  4001 

N>4000 

69 

16 
5 

4 
5 

0.26318 
0.51137 
0.56548 
0.58647 
0.74022 

ii 

0 
0 
0 
0 

58 
16 
5 
4 
5 
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Table 


Total  -  L+ 


Actual  length 

Ohs 

MRF. 

N>0 

1302 

0.27382 

684 

618 

N<501 
500  <  N  <  1001 
1000  <  N  <  2001 
2000  <  N  <  4001 

N>4000 

702 
165 
186 
148 
101 

0.24372 
0.30436 
0.38194 
0.26436 
0.24792 

434 
65 
90 
60 
35 

268 

100 

96 

88 

66 

Table        UNIX  -  L+ 


Actual  Length 

Ohs 

MRF. 

Over 

Under 

N>0 

799 

0.24802 

424 

375 

N<501 

315 

0.24843 

201 

114 

500  <  N  <  1001 

135 

0.25507 

57 

78 

1000  <  N  <  2001 

149 

0.27545 

68 

81 

2000  <  N  <  4001 

131 

0.23769 

64 

67 

N>4000 

69 

0.19275 

34 

35 

Table 

Pascal  - 

L+ 

Actual  Length 

Ohs 

MRF. 

Over 

N>0 

404 

0.24323 

217 

187 

N<501 
500<N<  1001 
1000  <  N  <  2001 

2000  <  N  <  4001 
N>4000 

318 
14 
32 
13 
27 

0.20628 
0.47496 
0.48804 
0.18844 
0.29449 

168 

8 

25 

9 

7 

150 
6 

7 

4 

20 

Table 

C541  -  L+ 

Actual  Length 

Ohs 

MRF 

Over 

N>0 

99 

0.32592 

61 

38 

N<501 
500  <  N  <  1001 
1000  <  N  <  2001 
2000  <  N  <  4001 

N>4000 

69 

16 
5 
4 
5 

0.33915 
0.28732 
0.15942 
0.14576 
0.57761 

48 

8 
2 
2 
1 

21 

8 
3 
2 
4 
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Table 

Total  -  L 

Actual  Lenglh 

Ohs 

MRF. 

N>0 

1302 

0.25430 

745 

557 

N<501 
500  <  N  <  1001 
1000  <  N  <  2001 
2000  <  N  <  4001 

N>4000 

702 
165 
186 
148 
101 

0.23702 
0.28552 
0.28418 
0.24847 
0.27696 

436 
84 

113 
79 
33 

266 
81 
73 
69 
68 

Table 


UNIX  -  L 


Actual  Length 

Ohs: 

MRF. 

N>0 

799 

0.23204 

449 

350 

N<501 
500  <  N  <  1001 
1000  <  N  <  2001 
2000  <  N  <  4001 

N>4000 

315 
135 
149 
131 
69 

0.24184 
0.23256 
0.23361 
0.23568 
0.17600 

212 
69 
80 
61 
27 

103 
66 
69 
70 
42 

Table 


Pascal  - L 


Actual  Length 

Ohs. 

MRF. 

N>0 

404 

0.25147 

221 

183 

N<501 
500  <  N  <  1001 
1000  <  N  <  2001 
2000  <  N  <  4001 

N>4000 

318 
14 
32 
13 
27 

0.20338 
0.58552 
0.48848 
0.19022 
0.39326 

174 

9 

27 

9 

2 

144 
5 
5 
4 

25 

Table 

CS41  -  L* 

Actual  Length 

Ohs 

MRF. 

N>0 

99 

0.32858 

60 

39 

N<501 
500  <  N  <  1001 
1000  <  N  <  2001 
2000  <  N  <  4001 

N>4000 

69 

16 
5 
4 
5 

0.35174 
0.29335 
0.16039 
0.12378 
0.45380 

48 
8 
2 
1 
1 

21 

8 
3 
3 
4 
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Table       Total  -  NL+ 


Actual  Length 

Ohs. 

MRF 

N>0 

1302 

0.66637 

1017 

285 

N<501 

702 

0.95911 

679 

23 

500  <  N  <  1001 

165 

0.37928 

121 

44 

1000  <  N  <  2001 

186 

0.40033 

127 

59 

2000  <  N  <  4001 

148 

0.22919 

71 

77 

N>4000 

101 

0.23128 

19 

82 

Table 

UNIX  -  NL+ 

Actual  Lenjnh 

Ohs. 

MRF. 

Over 

N>0 

799 

0.40005 

561 

238 

N<501 
500  <  N  <  1001 
1000<N<2001 
2000  <  N  <  4001 

N>4000 

315 
135 
149 
131 
69 

0.62559 
0.30143 
0.27761 
0.21496 
0.17921 

292 
99 
89 
64 
17 

23 
36 
60 
67 
52 

Table       Pascal  -  NL+ 


Actual  Length 

Ohs 

MRF. 

Over 

N>0 

404 

0.50958 

344 

60 

N<501 
500  <  N  <  1001 
1000  <  N  <  2001 
2000  <  N  <  4001 

N>4000 

318 
14 
32 
13 

27 

0.51024 
0.63585 
0.72585 
0.34342 
0.26000 

286 

9 

29 

12 

8 

32 
5 
3 
1 

19 

Table 

C541  -  NL+ 

Actual  Length 

Ohs 

MRF. 

N>0 

99 

0.85043 

84 

15 

N<501 
500  <  N  <  1001 
1000  <  N  <  2001 
2000  <  N  <  4001 

N>4000 

69 

16 

5 

4 

5 

1.08063 
0.35793 
0.19924 
0.13099 
0.47635 

67 
11 

3 
2 
1 

2 
5 
2 
2 
4 
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Table 

Total  -  NL* 

Actual  length 

Ohs 

MRF. 

N>0 

1302 

0.28892 

844 

458 

N<501 
500  <  N  <  1001 
1000  <  N  <  2001 
2000  <  N  <  4001 

N  >  4000 

702 
165 
186 
148 
101 

0.27265 
0.31585 
0.33230 
0.28667 
0.28150 

486 
96 

125 
93 
44 

216 
69 
61 
55 
57 

Table        UNIX  -  NL* 


Actual  Le.nph 

Ohs 

MRF. 

N>0 

799 

0.27106 

512 

287 

N<501 
500  <  N  <  1001 
1000  <  N  <  2001 
2000  <  N  <  4001 

N>4000 

315 
135 
149 
131 
69 

0.32693 
0.25292 
0.24686 
0.23482 
0.17260 

247 
82 
89 
66 
28 

68 

53 
60 
65 
41 

Table 

Pascal  -  NL* 

Actual  Length 

Ohs 

MRF. 

N>0 

404 

0.57086 

55 

349 

N<501 
500  <N<  1001 
1000  <  N  <  2001 
2000  <  N  <  4001 

N>4000 

318 
14 

32 
13 
27 

0.59838 
0.59835 
0.62654 
0.42233 
0.23807 

1 

8 
26 
11 

9 

317 

6 

6 

2 

18 

Table 

CS41  -  NL* 

Actual  Length 

Ohs 

MRF. 

N>0 

99 

0.53149 

80 

19 

N<501 
500  <  N  <  1001 
1000  <  N  <  2001 
2000  <  N  <  4001 

N>4000 

69 
16 

5 
4 
5 

0.61816 
0.38630 
0.20433 
0.16734 
0.41846 

62 
10 

4 
3 
1 

7 
6 
1 

1 

4 

44 


CONCLUSIONS  AND  FUTURE  WORK 

The  study  attempted  to  illustrate  appropriate  statistical  methods  for  length 
measuring,  and  for  adjusting  measures  for  the  effect  of  size.  The  estimation  can 
not  be  a  case  independent  work,  it  must  be  based  upon  the  results  obtained  in 
the  past.  In  [DeMarco  82,  pp.6-7],  DeMarco  asserted  the  principle  of  the  meas- 
urement: 

"  Measurement  is  always  a  recording  of  past  effects.   The  uses  we  will 
want  to  make  of  our  measurement  nearly  always  involve  some  predictive 
quantification  of  future  effect.  ...  the  estimating  function  is  based 
rigorously  on  statistics  collected  from  past  activities." 

In  the  real  world,  the  various  specifications,  programming  tools,  even  the 
personnel  involved  in  the  project,  will  all  be  factors  that  influence  the 
parametric  value  in  the  model.  The  key  point  is  that  a  model  can  efficiently 
and  accurately  utilize  the  past  record  and  then  develop  more  reliable  parameters 
to  estimate  the  software  length. 

Four  models  were  proposed  based  upon  the  idea  of  linear  regression 
modeling,  and  the  data  sets  were  transformed  in  order  to  meet  the  requirement 
of  statistics  features.  Including  the  other  four  models  proposed  by  various  arti- 
cles, [Halstead  72],  [Fitsos  80],  [Albrecht  83]  and  [Jensen  85],  eight  models 
were  analyzed  and  compared.  Not  only  were  correlation  analysis  done  between 
the  estimated  and  the  actual  length,  but  also  the  mean  of  relative  error  and 
counts  of  over/under  estimating  techniques  were  employed  in  the  comparison 
tasks.  The  results  of  the  models  L+  and  L* ,  proposed  by  the  author,  were 
more  precise  in  estimating  the  length  than  the  other  models.  They  provided 
smaller  MRE  and  balanced  the  over/under  estimating  counts. 
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Future  Work 

In  this  report,  the  logarithmic  transformation  was  employed  in  order  to 
transform  the  data  set.  There  are  other  ways  to  transform  the  data,  but  that 
analysis  will  be  left  for  the  future  work.  There  still  needs  more  attention  to 
analyzing  the  trend  of  the  MRE  behavior  in  some  models,  such  as  H ,  J ,  and 
A  .  Those  provide  moderate  MRE;  however,  it  shows  a  trend  of  bias  depend- 
ing upon  whether  the  actual  program  length  increases  or  decreases.  The  model- 
ing methods  introduced  in  this  report  could  also  be  useful  in  the  estimation  of 
other  program  characteristics.   This  also  deserves  more  work  in  the  future. 
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Token  Counts  Of  UNIX  Source  Programs 
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25 

135 

64 

199 

7 

87 

192 

159 

351 

15 

9 

41 

17 

58 

25 

18 

69 

33 

102 

43 

35 

194 

80 

274 

43 

31 

233 

98 

331 

122 

289 

3135 

1945 

5080 

47 

40 

262 

116 

378 

59 

59 

472 

238 

710 

79 

91 

843 

389 

1232 

51 

55 

418 

185 

603 

91 

136 

1237 

646 

1883 

72 

78 

543 

218 

761 

-59- 


120 

163 

1587 

747 

2334 

75 

185 

2220 

1145 

3365 

67 

67 

495 

212 

707 

98 

106 

920 

408 

1328 

35 

17 

117 

50 

167 

74 

113 

1341 

705 

2046 

36 

26 

187 

71 

258 

74 

85 

824 

385 

1209 

90 

133 

1231 

738 

1969 

■SO 

111 

1009 

493 

1502 

52 

58 

560 

302 

862 

159 

314 

3988 

1829 

5817 

122 

186 

2084 

1019 

3103 

173 

266 

2619 

1227 

3846 

47 

40 

254 

108 

362 

54 

64 

351 

159 

510 

72 

95 

693 

371 

1064 

31 

25 

177 

84 

261 

28 

20 

113 

56 

169 

157 

372 

4227 

2054 

6281 

86 

145 

1589 

769 

2358 

50 

71 

521 

278 

799 

48 

42 

299 

144 

443 

37 

36 

243 

106 

349 

78 

94 

1165 

507 

1672 

54 

69 

645 

272 

917 

58 

81 

870 

394 

1264 

65 

117 

858 

397 

1255 

33 

23 

224 

105 

329 

56 

75 

650 

277 

927 

73 

61 

509 

203 

712 

50 

50 

386 

156 

542 

88 

219 

2080 

1247 

3327 

112 

282 

3000 

1688 

4688 

98 

171 

1976 

1158 

3134 

76 

107 

1612 

990 

2602 

121 

298 

4278 

2495 

6773 

85 

131 

836 

377 

1213 

131 

257 

2737 

1244 

3981 

88 

134 

1040 

576 

1616 

162 

278 

2631 

1288 

3919 

154 

298 

3380 

1655 

5035 

23 

13 

47 

19 

66 

134 

290 

3328 

1430 

4758 

146 

218 

2078 

959 

3037 

100 

93 

728 

320 

1048 

90 

185 

1808 

929 

2737 

94 

153 

1736 

784 

2520 

48 

35 

244 

107 

351 

58 

48 

402 

173 

575 

35 

32 

224 

113 

337 

■60- 


68 

121 

1100 

513 

1613 

15 

5 

20 

7 

27 

74 

160 

1780 

953 

2733 

31 

35 

168 

87 

255 

30 

21 

256 

102 

358 

17 

6 

29 

11 

40 

22 

8 

44 

18 

62 

24 

19 

98 

57 

155 

18 

7 

51 

20 

71 

47 

63 

418 

197 

615 

24 

13 

66 

32 

98 

30 

27 

127 

52 

179 

22 

12 

49 

18 

67 

16 

7 

25 

8 

33 

17 

6 

33 

12 

45 

21 

10 

76 

40 

116 

61 

41 

422 

176 

598 

12 

3 

20 

3 

23 

22 

9 

45 

19 

64 

22 

9 

45 

19 

64 

50 

104 

788 

383 

1171 

54 

99 

1216 

583 

1799 

26 

20 

93 

48 

141 

19 

9 

55 

23 

78 

14 

5 

27 

12 

39 

21 

10 

49 

20 

69 

31 

32 

147 

64 

211 

24 

10 

74 

34 

108 

15 

3 

24 

6 

30 

22 

9 

57 

26 

83 

19 

9 

65 

28 

93 

25 

15 

109 

44 

153 

10 

3 

12 

4 

16 

27 

14 

77 

37 

114 

19 

6 

43 

16 

59 

53 

35 

427 

206 

633 

29 

12 

78 

27 

105 

24 

15 

64 

29 

93 

29 

21 

116 

44 

160 

24 

15 

83 

35 

118 

17 

6 

32 

11 

43 

14 

5 

23 

7 

30 

21 

9 

40 

18 

58 

18 

6 

34 

13 

47 

16 

6 

34 

13 

47 

27 

13 

75 

31 

106 

39 

18 

149 

50 

199 

20 

12 

44 

25 

69 

51 

44 

336 

174 

510 

26 

22 

108 

54 

162 

36 

22 

147 

65 

212 

■61 


27 

18 

152 

59 

211 

65 

82 

536 

302 

838 

43 

31 

187 

88 

275 

76 

91 

711 

309 

1020 

IK 

17 

67 

38 

105 

41 

27 

288 

161 

449 

19 

12 

42 

26 

68 

27 

12 

63 

26 

89 

IS 

9 

30 

13 

43 

34 

24 

141 

80 

221 

22 

10 

49 

16 

65 

20 

14 

52 

25 

77 

57 

42 

392 

174 

566 

42 

31 

236 

114 

350 

50 

42 

291 

142 

433 

58 

70 

572 

322 

894 

36 

45 

231 

146 

377 

24 

11 

80 

31 

111 

42 

27 

149 

62 

211 

22 

12 

65 

27 

92 

26 

16 

66 

41 

107 

27 

15 

79 

32 

111 

35 

16 

185 

71 

256 

29 

12 

66 

25 

91 

30 

22 

110 

45 

155 

85 

134 

1346 

748 

2094 

32 

19 

232 

147 

379 

45 

67 

271 

148 

419 

51 

68 

678 

308 

986 

84 

117 

981 

506 

1487 

126 

225 

2242 

1031 

3273 

70 

52 

650 

281 

931 

98 

170 

1510 

797 

2307 

77 

99 

549 

252 

801 

7 

47 

81 

57 

138 

88 

94 

951 

423 

1374 

15 

119 

425 

234 

659 

49 

68 

325 

160 

485 

137 

313 

3418 

1723 

5141 

56 

100 

880 

434 

1314 

112 

331 

3234 

1833 

5067 

70 

110 

785 

518 

1303 

120 

162 

1667 

743 

2410 

156 

379 

5117 

2824 

7941 

27 

141 

239 

188 

427 

9 

6 

23 

6 

29 

114 

221 

2010 

1039 

3049 

59 

93 

745 

383 

1128 

7 

7 

12 

7 

19 

11 

13 

32 

24 

56 

93 

161 

1684 

728 

2412 

-62 


72 

215 

2575 

1271 

3846 

70 

164 

1707 

843 

2550 

35 

35 

171 

93 

264 

157 

297 

3898 

1672 

5570 

80 

145 

1049 

645 

1694 

51 

55 

322 

171 

493 

59 

68 

338 

186 

524 

42 

43 

222 

88 

310 

64 

78 

852 

497 

1349 

38 

32 

136 

69 

205 

10 

10 

20 

10 

30 

86 

216 

1509 

926 

2435 

90 

180 

1368 

631 

1999 

80 

138 

953 

418 

1371 

87 

130 

999 

440 

1439 

25 

11 

69 

26 

95 

62 

77 

558 

294 

852 

8 

5 

9 

5 

14 

8 

4 

9 

4 

13 

55 

58 

367 

214 

581 

71 

119 

556 

317 

873 

9 

16 

41 

26 

67 

59 

196 

1977 

1282 

3259 

71 

117 

552 

315 

867 

9 

16 

41 

26 

67 

59 

198 

2003 

1298 

3301 

140 

372 

3825 

1825 

5650 

31 

19 

154 

63 

217 

28 

23 

125 

48 

173 

45 

48 

297 

152 

449 

61 

95 

875 

340 

1215 

98 

183 

2373 

1490 

3863 

69 

256 

1568 

811 

2379 

44 

108 

653 

436 

1089 

36 

46 

184 

112 

296 

76 

124 

1349 

812 

2161 

26 

108 

242 

164 

406 

11 

25 

50 

47 

97 

100 

225 

1672 

784 

2456 

89 

138 

956 

367 

1323 

112 

298 

3261 

1860 

5121 

103 

305 

3152 

2134 

5286 

35 

59 

1400 

790 

2190 

42 

82 

334 

251 

585 

134 

400 

3663 

2315 

5978 

61 

101 

937 

637 

1574 

11 

30 

63 

30 

93 

69 

82 

543 

293 

836 

114 

259 

2647 

1591 

4238 

101 

185 

1747 

1013 

2760 

84 

108 

720 

348 

1068 

-63- 


6 

4 

11 

4 

15 

91 

177 

1047 

626 

1673 

25 

19 

84 

47 

131 

22 

15 

82 

42 

124 

21 

9 

51 

14 

65 

51 

40 

318 

136 

454 

32 

21 

187 

73 

260 

46 

30 

320 

127 

447 

23 

17 

65 

25 

90 

47 

37 

241 

97 

338 

46 

50 

451 

250 

701 

26 

18 

64 

27 

91 

24 

18 

64 

27 

91 

24 

18 

64 

27 

91 

35 

27 

128 

57 

185 

24 

17 

64 

27 

91 

25 

14 

57 

19 

76 

55 

43 

460 

180 

640 

26 

18 

64 

27 

91 

32 

22 

90 

37 

127 

26 

17 

64 

27 

91 

28 

14 

66 

24 

90 

30 

20 

87 

35 

122 

24 

14 

58 

21 

79 

25 

8 

47 

16 

63 

23 

8 

41 

15 

56 

20 

15 

40 

25 

71 

48 

68 

358 

193 

551 

37 

3S 

341 

104 

445 

40 

21 

175 

52 

227 

59 

69 

530 

221 

751 

59 

114 

1261 

674 

1935 

48 

65 

394 

173 

567 

36 

25 

98 

40 

138 

55 

54 

379 

140 

519 

46 

50 

227 

106 

333 

99 

174 

2323 

1146 

3469 

34 

38 

362 

147 

509 

29 

29 

138 

68 

206 

35 

31 

262 

88 

350 

44 

43 

492 

168 

660 

66 

69 

562 

204 

766 

71 

81 

723 

301 

1024 

38 

34 

215 

90 

305 

109 

279 

3661 

1900 

5561 

55 

72 

643 

307 

950 

57 

38 

500 

242 

742 

69 

183 

1457 

670 

2127 

8 

34 

266 

259 

525 

11 

55 

109 

61 

170 

40 

29 

197 

68 

265 

-64- 


44 

30 

155 

63 

218 

64 

73 

734 

341 

1075 

45 

46 

347 

181 

528 

51 

111 

885 

380 

1265 

51 

32 

309 

154 

463 

81 

79 

937 

457 

1394 

86 

81 

677 

362 

1039 

86 

90 

789 

335 

1124 

102 

117 

759 

346 

1105 

15 

243 

569 

253 

822 

109 

120 

1362 

614 

1976 

62 

51 

468 

216 

684 

33 

16 

112 

68 

180 

57 

51 

596 

279 

875 

129 

174 

1462 

679 

2141 

13 

4 

25 

8 

33 

37 

20 

172 

61 

233 

25 

13 

139 

50 

189 

56 

75 

822 

279 

1101 

64 

61 

553 

260 

813 

167 

235 

2047 

927 

2974 

22 

13 

95 

56 

151 

9 

18 

31 

19 

50 

58 

32 

384 

149 

533 

80 

91 

686 

301 

987 

55 

49 

323 

130 

453 

43 

26 

198 

96 

294 

34 

22 

130 

67 

197 

28 

15 

81 

26 

107 

39 

27 

142 

53 

195 

26 

15 

66 

32 

98 

85 

104 

1525 

759 

2284 

45 

90 

1539 

896 

2435 

36 

43 

721 

4S2 

1203 

48 

70 

829 

491 

1320 

130 

501 

5172 

2940 

8112 

35 

82 

257 

177 

434 

50 

23 

234 

110 

344 

67 

41 

443 

215 

658 

19 

9 

33 

11 

44 

57 

42 

377 

229 

606 

65 

38 

692 

364 

1056 

64 

37 

398 

171 

569 

93 

247 

2213 

1008 

3221 

67 

88 

984 

599 

1583 

45 

52 

296 

145 

441 

72 

438 

3021 

1697 

4718 

80 

96 

617 

278 

895 

59 

43 

282 

148 

430 

101 

323 

2910 

1539 

4449 

88 

257 

1813 

951 

2764 

■65- 


60 

57 

554 

299 

853 

146 

259 

3156 

1494 

4650 

55 

51 

244 

130 

374 

51 

53 

289 

140 

429 

69 

80 

586 

281 

867 

37 

78 

1682 

1621 

3303 

66 

60 

509 

233 

742 

96 

143 

1467 

743 

2210 

41 

368 

2553 

951 

3504 

53 

55 

541 

325 

866 

138 

238 

4475 

2280 

6755 

132 

257 

2143 

993 

3136 

104 

107 

1221 

638 

1859 

52 

39 

589 

313 

902 

33 

23 

114 

47 

161 

58 

48 

328 

148 

476 

35 

34 

224 

90 

314 

59 

39 

421 

172 

593 

78 

134 

1373 

771 

2144 

14 

8 

28 

13 

41 

82 

111 

976 

518 

1494 

78 

78 

641 

314 

955 

19 

11 

66 

29 

95 

35 

30 

127 

56 

183 

66 

80 

829 

382 

1211 

132 

250 

2059 

890 

2949 

159 

358 

4066 

1897 

5963 

133 

301 

3154 

1522 

4676 

42 

29 

194 

80 

274 

55 

47 

403 

172 

575 

36 

24 

212 

81 

293 

49 

32 

180 

82 

262 

36 

40 

298 

118 

416 

30 

31 

362 

156 

518 

23 

14 

107 

46 

153 

72 

56 

587 

287 

874 

31 

18 

81 

37 

118 

58 

44 

378 

184 

562 

56 

44 

280 

127 

407 

45 

39 

215 

96 

311 

22 

13 

72 

27 

99 

37 

26 

174 

80 

254 

57 

96 

643 

380 

1023 

46 

73 

424 

237 

661 

38 

28 

148 

63 

211 

86 

149 

1606 

1003 

2609 

86 

135 

1067 

590 

1657 

12 

35 

56 

43 

99 

11 

3 

24 

10 

34 

19 

7 

39 

16 

55 

65 

73 

463 

217 

680 

-  66  - 


56 

53 

399 

166 

565 

11 

4 

14 

8 

22 

52 

48 

316 

148 

464 

50 

42 

206 

89 

295 

65 

57 

495 

224 

719 

45 

40 

162 

77 

239 

58 

41 

236 

106 

342 

55 

37 

208 

96 

304 

105 

169 

1286 

627 

1913 

118 

232 

2364 

1100 

3464 

13 

35 

71 

44 

115 

26 

12 

56 

28 

84 

63 

84 

552 

254 

806 

43 

41 

285 

132 

417 

125 

258 

2286 

1121 

3407 

93 

172 

1546 

748 

2294 

116 

216 

2821 

1357 

4178 

129 

234 

2121 

991 

3112 

32 

24 

145 

64 

209 

66 

46 

465 

210 

675 

34 

31 

248 

94 

342 

114 

161 

1879 

866 

2745 

50 

29 

247 

111 

358 

89 

68 

749 

325 

1074 

105 

112 

1323 

532 

1855 

97 

125 

726 

349 

1075 

114 

146 

2408 

1349 

3757 

122 

196 

1506 

721 

2227 

158 

244 

2210 

1038 

3248 

116 

169 

1690 

782 

2472 

23 

10 

65 

22 

87 

87 

156 

858 

431 

1289 

72 

55 

562 

229 

791 

175 

205 

2203 

651 

2854 

135 

126 

1021 

368 

1389 

181 

266 

3195 

1576 

4771 

13 

122 

499 

343 

842 

8 

9 

19 

9 

28 

84 

69 

697 

305 

1002 

152 

240 

2261 

989 

3250 

130 

180 

2070 

1016 

3086 

124 

200 

2436 

1072 

3508 

80 

68 

621 

237 

858 

187 

225 

2196 

907 

3103 

150 

202 

2315 

1214 

3529 

72 

199 

872 

492 

1364 

101 

108 

876 

376 

1252 

144 

164 

1901 

916 

2817 

130 

162 

1766 

884 

2650 

75 

78 

1313 

599 

1912 

125 

161 

2628 

1273 

3901 

-67- 


96 

85 

1009 

516 

1525 

93 

102 

770 

378 

1148 

124 

185 

1656 

813 

2469 

31 

16 

85 

37 

122 

55 

84 

756 

372 

1128 

101 

178 

2070 

1251 

3321 

123 

302 

3149 

1631 

4780 

44 

37 

274 

103 

377 

33 

35 

248 

139 

387 

34 

32 

179 

84 

263 

38 

38 

249 

108 

357 

79 

96 

1169 

507 

1676 

55 

55 

522 

235 

757 

57 

80 

829 

376 

1205 

66 

119 

864 

399 

1263 

34 

25 

230 

107 

337 

48 

42 

247 

108 

355 

42 

56 

480 

248 

728 

70 

69 

737 

370 

1107 

67 

112 

1438 

859 

2297 

41 

31 

188 

88 

276 

73 

146 

1377 

642 

2019 

62 

81 

690 

326 

1016 

31 

30 

137 

56 

193 

53 

66 

410 

194 

604 

69 

118 

738 

315 

1053 

92 

103 

820 

364 

1184 

50 

40 

306 

146 

452 

121 

196 

2037 

971 

3008 

89 

204 

1526 

715 

2241 

81 

92 

854 

363 

1217 

73 

131 

1324 

624 

1948 

94 

185 

2149 

1209 

3358 

94 

278 

3056 

1497 

4553 

76 

121 

1216 

712 

1928 

60 

92 

1007 

592 

1599 

Appendix  B 


Token  Counts  Of  CMPSC  541  Programs 


■69- 


Tll 


>b 


32 

22 

117 

72 

189 

32 

22 

127 

79 

206 

32 

22 

210 

136 

346 

29 

21 

418 

195 

613 

97 

219 

4568 

2794 

7362 

31 

27 

186 

82 

268 

46 

48 

828 

583 

1411 

28 

32 

265 

189 

454 

56 

131 

1332 

788 

2120 

43 

56 

2764 

2023 

4787 

41 

75 

620 

367 

987 

31 

21 

146 

62 

208 

33 

36 

235 

131 

366 

24 

32 

171 

124 

295 

29 

22 

76 

41 

117 

32 

37 

177 

89 

266 

22 

25 

99 

74 

173 

16 

28 

477 

314 

791 

22 

27 

163 

124 

287 

26 

19 

110 

54 

164 

70 

433 

3684 

3039 

6723 

35 

56 

353 

187 

540 

25 

20 

82 

31 

113 

18 

13 

242 

157 

399 

22 

31 

195 

132 

327 

19 

11 

85 

32 

117 

22 

25 

99 

74 

173 

29 

60 

309 

186 

495 

55 

96 

1304 

584 

1888 

40 

34 

272 

153 

425 

40 

39 

315 

179 

494 

20 

47 

485 

330 

815 

26 

22 

237 

163 

400 

76 

99 

1555 

853 

2408 

29 

13 

104 

54 

158 

37 

32 

207 

88 

295 

32 

32 

384 

226 

610 

22 

39 

400 

300 

700 

17 

11 

160 

110 

270 

31 

28 
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Description  of  Modules 

Four  modules  are  designed  to  implement  the  work  of  counting  and  distin- 
guishing the  tokens  of  the  programs  written  in  C  language.  These  four 
modules  should  be  used  in  the  form  of  the  following  script: 

%  pi  <  ProgramName  I  p2  I  sort  I  p3  I  p4  »  ResultsFileName 

Every     script     will     produce     a     line     of  result     appending     to     the 

"ResultsFileName".    After  processing  a  series  of  above  script,  with  different 

"ProgramName",  the  results  are  all  recorded  in  the  "ResultsFileName"  that 
serves  as  the  data  set  for  the  further  analysis. 

Module-1  was  designed  to  retrieve  all  the  tokens  (or  pieces)  and  comments 
from  the  input  program.  The  format  of  output  is  simply  line  by 
line,  each  line  presents  a  single  token  or  comment  symbol,  so  that 
can  be  processed  for  the  module-2. 

Module-2  was  designed  to  screen  out  the  comment  string  from  the  list,  and 
merge  some  pieces  which  should  not  be  separated  to  present  a 
token.  For  example,  in  the  preprocess  section  of  the  program,  "#" 
and  "include"  should  be  merge  together  to  be  as  "#include"  to 
represent  a  single  token.  This  module  was  also  marking  the  sym- 
bols for  particular  tokens  so  that  can  be  easily  recognized  and 
classified  in  the  module-4.  For  example,  the  tokens  which  is  fol- 
lowed by  parenthesis  are  marked  a  "*"  that  means  this  token  is  of 
operator.  For  the  other  example,  any  tokens  or  strings  were 
quoted  by  quotation  marks  were  labeled  "#"   to  represent  this 
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token  is  of  operand.    The  output  should  be  sorted  before  being 
used  in  the  module-3. 

Module-3  was  designed  to  count  the  amount  of  identical  tokens,  or  strings, 
the  output  presents  the  number  of  occurrence  and  corresponding 
token  (or  string)  by  lines.  The  output  is  used  directly  to  the 
module-4. 

Module-4  was  designed  to  classify  the  tokens  into  operators  or  operands. 
The  file  "keywords"  was  referred  as  a  library,  any  token  is  in  this 
list  will  be  viewed  as  an  operator.  Any  token  with  "*"  as  last 
character  is  of  operator,  with  "#"  is  of  operand.  All  constant 
numbers,  of  forms  of  decimal,  hex  or  oct,  are  treated  as  operands. 
The  count  of  distinct  operators  and  operands,  and  total  number  of 
operators  and  operands  are  in  the  output. 


■83- 


Module  1 


program  ctoken(input,output); 
type  slringtype=  array[1..80]  of  char; 
var  c:  char;  getastringiboolean;  i,k:  integer; 
stringvan  stringtype; 

function  getcharxhar; 
var  xxhar; 
begin  read(x); 

getchar:=  x 
end; 

function  alph(ch:char):boolean; 
begin 
if  (ch  in  ['a'..'z'])  or  (ch  in  ['A'..'Z'])  or  (ch='_')  then 

alph  :=  true 
else  alph:=  false 
end; 
function  alphnum(ch:char):boolean; 
begin 
if  (alphfeh))  or  (ch  in  ['0'..'9'])  then  alphnum:=true 
else  alphnum:=false 
end; 

procedure  iscomment; 
var  done:boolean; 
begin 
done  :=  false;  c:=getchar; 
while  (  not  done  )  do 
begin  if  (c='*')  then  begin  c:=getchar;  if(c=7')  then  done:=true  end 

else  c:=getchar 
end; 
write('*  this  comment  */') 
end; 

function  isblank(sUstringtype;i:integer):boolean; 
var  b:boolean;  k:integer; 
begin 
b  :=  true; 
for  k:=l  to  i  do 

begin  if(st[k]o'  ')  then  b:=false  end; 
isblank  ;=  b 
end;  (*  of  isblank  function  *) 

procedure  goahead; 
begin 

writefc);  c:=getchar 
end; 
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procedure  formatwrite(cr:char); 
begin 
repeat 
goahead; 
until  (c  in  ['d'.'u'.'o'.'x'/X'.T/e'.'E'.'g'.'R'.'G'.'c'.'s'.'^c']) 

or  (c=cr)  ; 
if(c  in  ['d','u't'o','x','X',"f,'e','E','g','R','G','c','s','%']) 
then  goahead 
end; 

(*  main  program  *) 

begin 
c  :=  getchar; 
while  (not  eof)  do 
begin 

if  (c='  ')  then  c:=getchar 
else  if  (c='#')  then  begin  goahead;  writeln  end 
else  if  (alph(c))  then 
begin 
goahead; 

while  (alphnum(c))  do  goahead; 
wrileln 
end 
else  if(c=V)  then 
begin  goahead; 

if  (c='*')  then  begin  iscomment;  c:=getchar  end; 
if  (c='=')  then  goahead; 
writeln 
end 
else  if(c='!')  then 
begin  goahead; 

if(c='=')  then  goahead; 
writeln 
end 
else  if(c='%')then 
begin  goahead; 

if(c='=')  then  goahead; 
writeln 
end 
else  if(c='&')then 
begin  goahead; 

if(c='&')  or  (c='=')  then  goahead; 
writeln 
end 
else  if(c='(')then  begin  goahead;  writeln  end 
else  if(c=')')lhen  c:=getchar 
else  if(c='*')then 
begin  goahead; 

if(c='=')  then  goahead; 
wrileln 
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end 
else  if(c='+')then 
begin  goahead; 

if(c='+')  or  (c='=')  then  goahead; 
writeln 
end 
else  if(c=',')then  begin  goahead;  writeln  end 
else  if(c='-')then 
begin  goahead; 

if(c='-')or(c='=')or(c='>')ihen  goahead; 
writeln 
end 
else  if(c=':')  then  begin  goahead;  writeln  end 
else  if(c=';')  then  begin  goahead;  writeln  end 
else  if(c='<')  then 
begin  goahead; 
if(e='<')  then 

begin  goahead;if(c='=')  then  goahead  end; 
if(c='=')  then  goahead; 
writeln 
end 
else  if(c='=')  then 
begin  goahead; 

if(c='=')  then  goahead; 
writeln 
end 
else  if(c=V)  then 
begin  goahead; 
if(c=V)  then 

begin  goahead;if(c='=')  then  goahead  end; 
if(c='=')  then  goahead; 
writeln 
end 
else  if(c='?')then  begin  goahead;  writeln  end 
else  if(c='[')then  begin  goahead;  writeln  end 
else  if(c=']')then  c:=getchar 
else  if(c='{')lhen  begin  goahead;  writeln  end 
else  if(c='}')then  c:=getchar 
else  if(c="  ')thcn  begin  goahead;  writeln  end 
else  if(c='")then 
begin  goahead; 

if(c='=')  then  goahead; 
writeln 
end 
else  if(c=T)then 
begin  goahead; 

if(c='=')or(c='r)then  goahead; 
writeln 
end 
else  if(c="")then 
begin  goahead;  writeln; 
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gctasuing  :=  false; 
i:=  1; 

while(co"")  do 
begin 

if  (c='%')lhcn  begin  formatwritc("");  writeln  end 
else  if  (c='  ')lhen 
begin  goahead; 

if(c  in  ['0'..'9'])  then 

repeat  goahead  until  (c  <  '0')  or  (c  >  '9') 
else  goahead;  writeln 
end 
else  begin  getastring:=true; 

if  (i<80)  then  stringvar[i]:=c; 
i:=i+  1; 
c:=getchar  end 
end; 
i  :=  i-1; 
if(getastring)  then  begin 
if  (isblank(stringvar,i))  then  begin 

writeln('blank-string',i:2)  end 
else  begin 

if  i>70  then  i:=70; 

for  k:=l  to  i  do  write(stringvar[k]);   write('#'); 
writeln;  end; 
end; 
writeln('#"'); 
c:=getchar 
end 
else  if(c="")then 

begin  goahead;  writeln; 
getastring  :=  false; 
i:=  1; 

white(co"")  do 
begin 

if  (c='%')then  begin  formalwrite("");writeln  end 
else  if  (c='  ')then 
begin  goahead; 

if(c  in  ['0'..'9'])  then 

repeat  goahead  until  (c  <  '0')  or  (c  >  '9') 
else  goahead;    writeln 
end 
else  begin  getastring:=true; 

if(i<80)  then  stringvar[i]:=c; 
i:=i+l; 

c;=getchar  end 
end; 


iffgetastring)  then  begin 
if(isblank(stringvar,i))then  begin 
writeln('blank-string',i:2)  end 
else  begin 
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if  i>70  then  i:=70; 

for  k:=l  lo  i  do  write(stringvar[k]);  writc('#'); 
writeln;  end; 
end; 
writeln('#"'); 
c  :=gelchar 
end 
else  if(c=Y)  ihen 
begin  goahead; 

if(c  in  ['0'..'9'])  then 
begin 
goahead; 

while  (c  in  ['0'..'9'])  do  goahead; 
if  (c='E')  or  (c='e')  then 
begin  goahead;  goahead; 

while  (c  in  ['0'..'9'])  do  goahead; 
if  (c='L')  or  (c=T)  then  goahead; 
end 
end; 
writeln 
end 
else  if(c='0')  then 
begin  goahead; 

if  (c  in  [V,'X'])or(c  in  ['0'..'7'])then 
begin 

if  (c  =  'X')  or  (c=  'x')  then 
begin 
goahead; 
while  (c  in  [■0'..'9,D  or  (c  in  ['a'..'F])  or 

(c  in  ['A'..'F'])  do  goahead; 
if  (c='L')  or  (c='l')  then  goahead 
end 
else  begin 
goahead; 

whilefc  in  ['O'..'7'l)  do  goahead; 
if  (c='L')  or  (c=T)  then  goahead 
end; 
writeln 
end 

else  if(co'.')  then  writeln 
else 
end 
else  if(c  in  [T..'9'])  then 
begin  goahead; 

if(c  in  ['0\.'9'])  or  (c=V)  then 
begin 
goahead; 

while  (c  in  ['0'..'9'])  do  goahead; 
if  (c=V)  then 
begin  goahead; 

if(c  in  ['0'..'9'])  then 


begin  goahead; 

while  (c  in  ['0'..'9'])  do  goahead; 
if  (c='E')  or  (c='e')  then 
begin  goahead;  goahead; 
while  (c  in  ['0'..'9'])  do  goahead; 
if  (c='L')  or  (c=T)  then  goahead; 
end 
end 
end; 
if  (c='E')  or  (c='e')  then 
begin  goahead;  goahead; 

while  (c  in  ['0'..'9'])  do  goahead; 
if  (c='L')  or  (c=T)  then  goahead; 
end 
end; 
if  (c='L')  or  (c=T)  then  goahead; 
writeln 
end 
else  c:=getchar 
end; 

writeln('}the  end') 
end. 


Module  2 


program  p2(input,output); 
type  stringtypc  =  record 

content :  array[1..80]of  char; 
count    :  integer 
end; 
var  s:  stringtype; 
n  :  integer; 

procedure  get(var  s:  stringtype); 
var  cxhar;  i:integer; 
begin 
i:=l; 
repeat 

begin  read(c);   s.content[i]:=c;  i  :=  i  +  1  end 
until  (eoln)  or  (eoO; 
s.count:=  i-1;  read(c) 
end; 

procedure  put(s:  stringtype); 
var  i:integer; 
begin  for  i:=  1  to  s.count  do  write(s.contcnt[i])  end; 


function  verify(s:stringtype):integer; 

(*  EOF=0;  identifier=l;  preprocessor=2;  comment=3;  equal=5;  else=4  *) 
var  c,last  :char; 
begin 
c:=s.content[l];  verify:=4;  last  :=  s.content[s.count]; 
if  (c=' ) ')  and  (last<>'#')  and 

(s.count=8)  and  (s.content[5]='  ')  then  verify:=0; 
if((c  in  ['a'..'z'])or(c  in  ['A'..'Z'])or(c  ='_'))  and 

(last  o  '#')  then  verify:=l; 
if(c  ='#')  and  (s.count=l)  then  verify:=2; 
if(c='#')  and  (s.count=2)  and  (s.content[s.counl]="")  then 

verify:=3; 
if(c=7')  and  (s.counolO)  and  0asto'#')  then  verify:=3; 
if(c='=')  and  (s.count=l)  then  verify:=5; 
end; 

begin 

get(s);  n  :=  verify(s); 
while(n  o  0)  do 
begin 

if(n=3)  then  get(s) 
else  if(n=l)  then 
begin 
put(s);  get(s); 
if(s.content[l]='(')  and  (s.count=l)  then  write('*'); 
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wrileln 
end 
else  if(n=2)then  (*  preprocessor  #  *) 
begin 
put(s);  get(s); 

if  (s.content[l]  in  ['a'..'z'])  or 
(s.content[l]  in  ['A'..'Z'])  or  (s.content[l]='_')  then 
begin 

put(s);  writeln;  get(s); 
if(s.content[l]  =  '<')  then 
begin 
put(s);wrile('>');  writeln;  gel(s); 
while(sxontent[l]o'>')do  begin  pui(s);  get(s)  end; 
writeln; 
get(s) 
end 
end 
else  writeln; 
end 
else  if  (n=5)  then 
begin 
put(s);  writeln;  get(s); 
if  (s.count=l)  then 
begin  if  (s.content[l]="")  or  (s.contenl[l]="")  then 
begin  put(s);  writeln;  get(s); 

while(s.count<>2)  or  (s.content[l]o'#')  or 
(s.content[2]o"")  do  begin 
put(s);  get(s)  end; 
write  ('#'); 
writeln 
end 
end 
end 
else  begin  put(s);  get(s);  writeln  end; 
n:=verify(s) 
end 
end. 
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Module  3 


program  p3(inpul,output); 
type  stringtype  =  record 

content :  array[1..80]of  char; 

count    :  integer 

end; 
var  sl,s2:  stringtype;  n:integer; 

procedure  get(var  s:  stringtype); 
var  cxhar;  i:integer; 
begin 
i:=  1; 
repeat 

begin  read(c);   s.contenl[i]:=c;  i  :=  i  +  1  end 
until  (eoln)  or  (eof); 
s.count:=  i-1;  read(c) 
end; 

procedure  put(s:  stringtype); 
var  i:integer; 
begin  for  i:=  1  to  s.count  do  write(s.content[i])  end; 

function  compare(sl,s2:stringtype):boolean; 
var  i:integer; 
begin 
if(sl.count=s2.count)then 
begin 
compare:=true; 
for  i:=l  to  sl.count  do 
if(sl.content[i]os2.content[i])then  compare:=false 
end 
else  compare:=false 
end; 

begin 
n:=l; 
get(sl); 
while  (not  eof )  do 

begin 

get(s2); 

if(compare(sl,s2))then  n:=n+l 

else  begin  write(n:5,'     ');  put(sl);writeln;  sl:=s2;  n:=l  end; 

if(eoO  then  begin  write(n:5,'     ');  put(sl);writeln;  sl:=s2;  n:=l  end; 
end 
end. 
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Module  4 


program  p4(input,output); 
const  nkey=33; 
type  stringtype  =  record 

content  :  array[1..80]of  char; 
count    :  integer 
end; 
var 

keyfile  :  text; 
s  :  stringtype; 

key:  array[l..nkey]of  stringtype; 
i,  r,  m,  d,  dn,  n,  nrd:  integer; 
iskey  :  boolean; 
c:  char; 

procedure  get(var  s:  stringtype); 
var  cichar;  i:integer; 
begin 
i:=  1; 
repeat 

begin  read(c);   s.content[i]:=c;  i  ;=  i  +  1  end 
until  (eoln)  or  (eof)  ; 
s.count:=  i-1;  rcad(c) 
end; 

function  conv(s:stringtype):integer; 
var  i,  k  :  integer; 
begin 
k:=0; 

for  i:=l  to  5  do 
begin 

if(s.content[i]o'  ')  then 
begin 

if(i=5)lhcn  k:=k+  ord(s.content[i])-48 
else  if(i=4)  then  k:=k+10*(ord(s.content[i]H8) 
else  if(i=3)  then  k:=k+100*(ord(s.content[i])-48) 
else  if(i=2)  then  k:=k+1000*(ord(s.content[i])-48) 
elsek:=k+10000*(ord(s.contcnt[i])-48) 
end 
end; 
conv  :=  k 
end; 

procedure  getkey(var  s:  stringtype); 
var  cxhar;  i:integer; 
begin 
i:=  1; 
repeat 
begin  read(keyfile,c);  s.content[i]:=c;  i  :=  i  +  1  end 
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until  (eoln(keyfile))  or  (eof(keyflle)); 
s.count:=  i-1;  read(keyfile,c) 
end; 

function  compare(sl  ,s2:stringtype):boolean; 
var  i:intcgcr; 
begin 
if(sl.count=s2.count-8)then 
begin 
compare:=truc; 
for  i:=l  to  si. count  do 
if(sl  .content[i]os2.content[i+8])then  compare:=false 
end 
else  compare:=false 
end; 


r:=0;rn:=0;d:=0;dn:=0;n:=0; 
reset(keynle,'keywords'); 

for  i:=  1  to  nkey  do  begin  getkey(s);  key[i]:=s  end; 
repeat 
begin 
get(s);  c  :=  s.content[9]; 
if(c  in  ['0'..'9'])  or  (c=")  then 

begin  d:=d+l;  dn:=dn+conv(s)  end 
else  if  (c='.')and(s.count>9)  then 

begin  d;=d+l;  dn:=dn+conv(s)  end 
else  if  (c  in  ['a'..'z'])or(c  in  ['A'..'Z'])or(c='_')  then 
begin  if(s.content[s.count]='*')then 

begin  r:=r+l;  rn:=m+conv(s)  end 
else  if  (s.content[s.count]='#')then 

begin  d:=d+l;  dn:=dn+conv(s)  end 
else  begin 
iskey:=false; 
for  i:=l  to  nkey  do 

if(compare(key[i],s))  then  iskey:=true; 
if(iskey)then  begin  r:=r+l;  m:=ra+conv(s)  end 
else  begin  d;=d+l;  dn:=dn+conv(s)  end 
end 
end 
else  begin  if  (s.content[s.count]='#')then 

begin  d:=d+l;  dn:=dn+conv(s)  end 
else  begin  r:=r+l;  rn:=rn+conv(s)  end 
end 
end; 
until  (eof); 
n  :=  dn+rn; 
nrd  :=  d+r; 

writeln(r:8,d:8,nrd:8,m:8,dn:8,n:8); 
end. 
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ABSTRACT 

This  report  discusses  the  existing  software  length  estimation  models, 
which  suggested  by  Halstead,  Fitsos,  Albrecht  and  Jensen  respectively.  Three 
data  sets,  UNIX  source  programs,  CMPSC  541  programs,  and  Pascal  programs 
are  used  to  develop  new  models  and  investigate  the  characteristics  of  the 
models.  The  raw  data  sets  are  normalized  by  logarithmic  function,  so  that  the 
transformed  data  sets  can  satisfy  the  requirements  of  further  statistical  modeling 
procedures.  In  this  report,  the  author  proposes  four  models  which  are 
developed  based  upon  linear  and  nonlinear  modeling  methods. 

The  experiments  are  conducted  to  compare  the  accuracy  of  the  models. 
All  the  models  present  high  correlation  between  the  estimated  length  and  the 
actual  length.  However,  correlation  analysis  is  not  sufficient  to  show  the 
superiority  or  inferiority  among  the  models;  therefore,  the  mean  of  relative 
error  (MRE)  and  the  counts  of  overestimating  and  underestimating  were 
employed  for  the  further  comparisons.  The  results  show  the  models  derived 
from  linear  modeling  provide  more  accuracy  estimation  than  any  other  models 
do,  having  not  only  the  lower  MRE  but  also  the  balancing  counts  of  over-  and 
under-  estimating.  The  Halstead's  model  tends  to  overestimate  the  small  pro- 
grams and  underestimate  the  large  ones.  Fitsos's  and  Albercht's  models  are 
suitable  for  large  programs  but  not  for  the  small  ones,  and  Albercht's  model  is 
much  more  accurate  than  Fitsos's.  Jensen's  model,  with  its  simple  structure, 
accurately  estimate  the  programs  of  not  large  size,  but  the  error  increases  when 
the  program  size  grows.  The  models  developed  from  nonlinear  modeling,  pro- 
vide moderate  accuracy  of  estimation. 


